fix: keep large LOAD external scan multi-CN by aptend · Pull Request #24855 · matrixorigin/matrixone

aptend · 2026-06-04T15:19:02Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

This fixes LOAD external scan stats so large LOAD jobs keep row/cardinality semantics for Cost, Outcnt, TableCnt, and BlockNum, while preserving Cost * Rowsize as the input-size hint used by external scan parallel sizing.

Previously the LOAD stats used input bytes as Cost with Rowsize=1 and forced BlockNum=1 / TableCnt=1. That can make large CSV/TBL LOAD choose AP_ONECN instead of the expected multi-CN AP path.

Tests:

CGO_CFLAGS="-I$(pwd)/cgo -I$(pwd)/thirdparties/install/include" go test ./pkg/sql/plan -count=1

qodo-code-review · 2026-06-04T15:19:07Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

XuPeng-SH · 2026-06-07T15:18:01Z

再补一个实质问题：

现在 makeLoadExternalStats 用的是 GetRowSizeFromTableDef(tableDef, true) * 0.8 来估算 file-based LOAD 的 rowSize，但在 LOAD DATA ... (col, @var, ...) 这条路径里，传进去的 tableDef 已经在 getExternalWithColListProject / bind_load.go 里被裁成了只包含真实目标列的 schema，而 @var 这种输入字段不会出现在这个 tableDef 里。

结果就是：实际输入行里明明还有额外字段，新的统计却只按目标列宽来算，rowSize 会被明显低估，进而把 rowCount / Cost / Outcnt / BlockNum 系统性放大。这个路径不是理论上的，仓库里现有用例就有：test/distributed/cases/load_data/load_data.sql 里多处都在测 (@col1, col2) / (col1, @col2)。

对这类 LOAD，执行路径阈值会被更早打穿，可能把本来不该进 multi-CN 的任务提前推到 multi-CN。也就是说，这次改动把“按真实输入行宽估算”的目标，在带 @var 的 column list 场景下又重新打偏了。

XuPeng-SH · 2026-06-07T15:38:16Z

看了最新更新后，前面 BlockNum 的 off-by-one 和 inline CSV 少算换行这两个点已经修掉了，但我还看到两个残留问题：

IGNORE n LINES 的首行采样可能仍然读到 header
现在 estimateLoadRowsizeFromFirstLine 会按传入的 offset 去读首行做 rowSize 估算，但 offset = IgnoredLines(...) 只在 stmt.Param.Tail.IgnoredLines > 0 && stmt.Param.Parallel && noCompress && !stmt.Param.Local 时才会计算。也就是说，只要这条 LOAD 一开始没有开 Parallel，即使声明了 IGNORE 1 LINES，rowSize 采样仍然会从文件头开始，拿到 header 而不是首条数据行。

这里的 stats 仍然会影响 GetExecType() 和 Cost * Rowsize 的并行度估算，所以这不是纯测试问题。建议把 header offset 的计算从 Parallel 条件里解耦，或者让采样逻辑自己跳过 ignored lines。
文件被不必要地排除在首行采样之外
estimateLoadRowsizeFromFirstLine 现在要求 loadLinesTerminatedBy(param) == "\n"，这会把显式 LINES TERMINATED BY \r\n 的 CSV 直接打回 schema fallback。\n\n 但 readExternalFirstLineSize 本身就是按 ReadString(n) 读的，天然可以把 \r\n 一起算进来；而且仓库里现有 LOAD DATA case 也明确支持 \r\n。所以这条 gate 会让一类常见文本文件继续落在偏差更大的 schema 估算上，和这次 PR 想修正的方向不一致。

XuPeng-SH

最新更新把我前面关注的点都补齐了：

BlockNum 边界的 off-by-one 改成了真正的 ceil；
inline CSV 的行宽估算现在会把 terminator 算进去；
file-based text load 新增了首行采样，并且已经补上了 IGNORE LINES / \r\n 这些我上轮卡住的残留点；
相关单测也跟着补到了边界和采样分支。

这版我这边看下来可以过。

aptend requested review from aunjgr and ouyuanning as code owners June 4, 2026 15:19

Copilot AI review requested due to automatic review settings June 4, 2026 15:19

Copilot started reviewing on behalf of aptend June 4, 2026 15:19 View session

aptend had a problem deploying to ci June 4, 2026 15:19 — with GitHub Actions Error

aptend temporarily deployed to ci June 4, 2026 15:19 — with GitHub Actions Inactive

aptend had a problem deploying to ci June 4, 2026 15:19 — with GitHub Actions Error

aptend temporarily deployed to ci June 4, 2026 15:19 — with GitHub Actions Inactive

aptend had a problem deploying to ci June 4, 2026 15:19 — with GitHub Actions Error

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Jun 4, 2026

aptend marked this pull request as draft June 4, 2026 15:19

mergify Bot added the kind/bug Something isn't working label Jun 4, 2026

mergify Bot temporarily deployed to ci June 4, 2026 15:20 Inactive

aptend had a problem deploying to ci June 7, 2026 15:16 — with GitHub Actions Error

aptend temporarily deployed to ci June 7, 2026 15:16 — with GitHub Actions Inactive

aptend had a problem deploying to ci June 7, 2026 15:16 — with GitHub Actions Error

aptend temporarily deployed to ci June 7, 2026 15:16 — with GitHub Actions Inactive

aptend force-pushed the fix-load-external-stats-multicn branch from 3308b8e to 1641216 Compare June 7, 2026 15:47

aptend added 2 commits June 7, 2026 23:49

fix: keep large load external scan multi-cn

b2adeb9

fix: refine load external stats estimates

6ea83f8

aptend had a problem deploying to ci June 7, 2026 15:50 — with GitHub Actions Error

aptend temporarily deployed to ci June 7, 2026 15:50 — with GitHub Actions Inactive

aptend had a problem deploying to ci June 7, 2026 15:50 — with GitHub Actions Error

aptend temporarily deployed to ci June 7, 2026 15:50 — with GitHub Actions Inactive

aptend had a problem deploying to ci June 7, 2026 15:50 — with GitHub Actions Error

aptend force-pushed the fix-load-external-stats-multicn branch from 1641216 to 6ea83f8 Compare June 7, 2026 15:50

aptend temporarily deployed to ci June 7, 2026 15:50 — with GitHub Actions Inactive

iamlinjunhong approved these changes Jun 7, 2026

View reviewed changes

XuPeng-SH approved these changes Jun 7, 2026

View reviewed changes

aptend mentioned this pull request Jun 8, 2026

fix: bound LOAD first-line row size sampling #24878

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep large LOAD external scan multi-CN#24855

fix: keep large LOAD external scan multi-CN#24855
heni02 merged 2 commits into
matrixorigin:mainfrom
aptend:fix-load-external-stats-multicn

aptend commented Jun 4, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented Jun 4, 2026

Uh oh!

XuPeng-SH commented Jun 7, 2026

Uh oh!

XuPeng-SH commented Jun 7, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

aptend commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented Jun 4, 2026

Qodo reviews are paused for this user.

Uh oh!

XuPeng-SH commented Jun 7, 2026

Uh oh!

XuPeng-SH commented Jun 7, 2026

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aptend commented Jun 4, 2026 •

edited

Loading